Introduction to Phylogenies and the Comparative Method

Showing some neat features of R!
Author
Affiliation
Published

June 5, 2024

Note

In this exercise, you will learn basic tools in R for visualizing phylogenies, testing models of character evolution, performing phylogenetic correction of a regression model, and test for the phylogenetic signal of continuous characters. This lab is based in part on one designed by Luke Harmon for a workshop that he and others ran at Ilha Bela, Brazil; the original can be seen here There are many other useful labs in comparative analysis from that workshop that you can peruse at your leisure.

You will need two datasets, that will be provided for you:

  1. A data.frame with species and mean spectra values – NIMBioS_spectra_sample.csv

  2. A phylogenetic tree – NIMBioS_phylogeny.nex

1 Clone the repository

You can clone the repository using Git. Mac and Linux users have Git installed natively in their computers, but if you are a Windows user, you might need to install Git first (https://git-scm.com/download/win)

To clone the repository, you just need (example for Mac) to follow the next steps:

  1. Open terminal

  2. Type “pwd” without quotes. This means Print Working Directory.

  3. Type “cd” without quotes followed by the folder you want to work on, for example: cd Documents

  4. Type the following command

4.1. git clone https://github.com/jesusNPL/MLM_intro.git

This command will clone the full repository in the folder you selected. That’s it!

2 Set up your data and your working directory

For this lab, you will need to have a set of R packages to do this lab. Install the following packages:

Code
# Package vector names 
packages <- c("tidyverse", "ape", "geiger", "ggnewscale", "knitr", 
              "phytools", "neonUtilities", "devtools") 
Function install.packages()

You can use the function install.packages() to install the packages.

If you don’t want to install the packages one by one, you can use the next command.

Code
# Install packages not yet installed 
# get packages already installed
installed_packages <- packages %in% rownames(installed.packages())

# If the packages are installed skip if not install them
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages], dependencies = TRUE)
}

This command, will, first, check if you already the packages installed, then if a package is not installed in your computer, will install it.

We will also need the R package {V.PhyloMaker} to obtain a phylogenetic tree for our plant species. Note that this package is not located in CRAN, so we will install the package directly from the author’s github.

Code
if ( ! ("V.PhyloMaker" %in% installed.packages())) {remotes::install_github("jinyizju/V.PhyloMaker")}

Load installed packages:

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(ape)

Attaching package: 'ape'

The following object is masked from 'package:dplyr':

    where
Code
library(geiger)
Loading required package: phytools
Loading required package: maps

Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map
Code
library(phytools)
library(neonUtilities)
library(V.PhyloMaker)

Set up a working directory. Tell R that this is the directory you will be using, and read in your data:

Function getwd()

You can use the function getwd() to get the current working directory.

Code
setwd("path/for/your/directory")
Function dir.create()

You can use the function dir.create() to get create a series of folders within your working directory. For example, if you run dir.create(“Output”) it will create an empty folder–named Output–within your working directory. This folder then can be used to store the results from the lab.

3 Part 1 - Obtain a phylogenetic tree

3.1 Prepare data

3.1.1 Download plant inventories data from NEON

We will download data of plant inventories directly from NEON and to do that we will use the R package {neonUtilities} and the function ‘loadByProduct’. Before proceed let’s take a look at the core information required in this function. To do that you just can type ?loadByProduct in your console and the documentation for this function will appear in the Help window of RStudio.

The core information we need to inform in the ‘loadByProduct’ function are:

  1. dpID = The identifier of the NEON data product to pull, in the form DPL.PRNUM.REV, e.g. DP1.10023.001.

  2. site = Either the string ‘all’, meaning all available sites, or a character vector of 4-letter NEON site codes, e.g. c(‘ONAQ’,‘RMNP’). Defaults to all.

  3. package = Either ‘basic’ or ‘expanded’, indicating which data package to download. Defaults to basic.

As you can see, the dpID and site correspond to the kind of data we want to download and the site to the location where the data were collected, respectively. We can open the NEON website to find the information required for downloading the data for plant presence and percent cover. Also, you can look at the map of NEON sites to see the distribution of NEON sites across The United States.

For this practice we will use the next information: dpID = “DP1.10058.001” and site = c(“HARV”) that correspond to plant presence and percent cover and the site: HARV (Harvard Forest & Quabbin Watershed NEON, MA). Note that you can download data only for all sites but for the sake of getting more practice in data management in R we will download the full data of plant community data for one site (you can download data for more sites if you prefer).

Code
# Set global option to NOT convert all character variables to factors
options(stringsAsFactors = FALSE)

NEON_data <- loadByProduct(dpID = "DP1.10058.001", 
                        site = "HARV", 
                        package = "expanded", 
                        check.size = TRUE)

# type "y" (with no quotes) in your console to start downloading the data from NEON

Let’s inspect the downloaded data.

Code
load("../data/NEON/RawData_NEON_HARV.RData")

names(NEON_data)
 [1] "categoricalCodes_10058"      "citation_10058_RELEASE-2024"
 [3] "div_10m2Data100m2Data"       "div_1m2Data"                
 [5] "div_geneticarchive"          "div_morphospecies"          
 [7] "div_voucher"                 "issueLog_10058"             
 [9] "readme_10058"                "validation_10058"           
[11] "variables_10058"            
Code
#View(NEON_data$div_10m2Data100m2Data)

#View(NEON_data$div_1m2Data)

Save the raw data in your hard drive - this is a common practice that allow reproducibility and also save you a lot of time.

Code
dir.create("../data/NEON")

save(NEON_data, file = "../data/NEON/RawData_NEON_HARV.RData")

The data of abundance of plants correspond to the object “div_1m2Data”, thus let’s isolate that data from the rawdata.

Code
sel <- NEON_data$div_1m2Data %>% 
  select(namedLocation, domainID, siteID, plotType, plotID, subplotID, endDate, 
         taxonID, taxonRank, family, scientificName, nativeStatusCode, 
         percentCover, heightPlantSpecies)

unique(sel$namedLocation)
 [1] "HARV_027.basePlot.div" "HARV_006.basePlot.div" "HARV_013.basePlot.div"
 [4] "HARV_023.basePlot.div" "HARV_001.basePlot.div" "HARV_010.basePlot.div"
 [7] "HARV_004.basePlot.div" "HARV_002.basePlot.div" "HARV_012.basePlot.div"
[10] "HARV_030.basePlot.div" "HARV_014.basePlot.div" "HARV_031.basePlot.div"
[13] "HARV_059.basePlot.div" "HARV_005.basePlot.div" "HARV_022.basePlot.div"
[16] "HARV_058.basePlot.div" "HARV_011.basePlot.div" "HARV_033.basePlot.div"
[19] "HARV_008.basePlot.div" "HARV_026.basePlot.div" "HARV_018.basePlot.div"
[22] "HARV_021.basePlot.div" "HARV_028.basePlot.div" "HARV_029.basePlot.div"
[25] "HARV_025.basePlot.div" "HARV_024.basePlot.div" "HARV_015.basePlot.div"
[28] "HARV_016.basePlot.div" "HARV_017.basePlot.div" "HARV_020.basePlot.div"
[31] "HARV_063.basePlot.div" "HARV_035.basePlot.div" "HARV_034.basePlot.div"
Code
unique(sel$siteID)
[1] "HARV"
Code
unique(sel$endDate)
  [1] "2014-06-02 GMT" "2014-06-04 GMT" "2014-06-05 GMT" "2014-06-06 GMT"
  [5] "2014-06-10 GMT" "2014-06-11 GMT" "2014-06-12 GMT" "2014-06-13 GMT"
  [9] "2014-06-17 GMT" "2014-06-19 GMT" "2014-06-20 GMT" "2014-06-23 GMT"
 [13] "2014-06-24 GMT" "2014-06-26 GMT" "2014-06-27 GMT" "2014-06-30 GMT"
 [17] "2014-07-01 GMT" "2014-07-02 GMT" "2014-07-03 GMT" "2014-07-07 GMT"
 [21] "2014-07-08 GMT" "2014-07-09 GMT" "2014-07-10 GMT" "2015-06-01 GMT"
 [25] "2015-06-02 GMT" "2015-06-03 GMT" "2015-06-04 GMT" "2015-06-08 GMT"
 [29] "2015-06-09 GMT" "2015-06-10 GMT" "2015-06-15 GMT" "2015-06-16 GMT"
 [33] "2015-06-17 GMT" "2015-06-18 GMT" "2015-06-22 GMT" "2015-06-23 GMT"
 [37] "2016-03-09 GMT" "2016-05-31 GMT" "2016-06-01 GMT" "2016-06-02 GMT"
 [41] "2016-06-03 GMT" "2016-06-06 GMT" "2016-06-07 GMT" "2016-06-08 GMT"
 [45] "2016-06-09 GMT" "2016-06-13 GMT" "2016-06-15 GMT" "2016-06-14 GMT"
 [49] "2016-06-16 GMT" "2016-06-20 GMT" "2016-06-21 GMT" "2016-06-23 GMT"
 [53] "2016-06-27 GMT" "2017-06-01 GMT" "2017-06-02 GMT" "2017-06-07 GMT"
 [57] "2017-06-08 GMT" "2017-06-12 GMT" "2017-06-14 GMT" "2017-06-15 GMT"
 [61] "2017-06-20 GMT" "2017-06-21 GMT" "2017-06-26 GMT" "2017-06-27 GMT"
 [65] "2017-06-28 GMT" "2017-06-29 GMT" "2017-07-03 GMT" "2017-07-04 GMT"
 [69] "2017-07-05 GMT" "2017-07-06 GMT" "2017-07-10 GMT" "2018-06-18 GMT"
 [73] "2018-06-19 GMT" "2018-06-20 GMT" "2018-06-21 GMT" "2018-06-25 GMT"
 [77] "2018-06-26 GMT" "2018-06-27 GMT" "2018-06-29 GMT" "2018-07-03 GMT"
 [81] "2018-07-06 GMT" "2018-07-11 GMT" "2018-07-12 GMT" "2018-07-18 GMT"
 [85] "2018-07-19 GMT" "2018-07-23 GMT" "2018-07-26 GMT" "2018-07-30 GMT"
 [89] "2018-07-31 GMT" "2019-06-17 GMT" "2019-06-18 GMT" "2019-06-19 GMT"
 [93] "2019-06-20 GMT" "2019-06-24 GMT" "2019-06-25 GMT" "2019-06-26 GMT"
 [97] "2019-06-27 GMT" "2019-07-01 GMT" "2019-07-02 GMT" "2019-07-03 GMT"
[101] "2020-07-01 GMT" "2020-07-02 GMT" "2020-07-03 GMT" "2020-07-04 GMT"
[105] "2020-07-05 GMT" "2020-07-06 GMT" "2020-07-07 GMT" "2020-07-08 GMT"
[109] "2020-07-09 GMT" "2020-07-11 GMT" "2020-07-12 GMT" "2020-07-13 GMT"
[113] "2020-07-15 GMT" "2020-07-16 GMT" "2020-07-17 GMT" "2020-07-18 GMT"
[117] "2020-07-19 GMT" "2020-07-20 GMT" "2020-07-21 GMT" "2020-07-22 GMT"
[121] "2020-07-23 GMT" "2020-07-24 GMT" "2020-07-25 GMT" "2020-07-26 GMT"
[125] "2020-07-27 GMT" "2020-07-28 GMT" "2020-07-29 GMT" "2020-07-31 GMT"
[129] "2020-08-01 GMT" "2020-08-02 GMT" "2020-08-03 GMT" "2020-08-06 GMT"
[133] "2020-08-07 GMT" "2020-08-08 GMT" "2020-08-10 GMT" "2020-08-11 GMT"
[137] "2020-08-12 GMT" "2020-08-14 GMT" "2021-06-01 GMT" "2021-06-14 GMT"
[141] "2021-06-21 GMT" "2021-06-22 GMT" "2021-06-23 GMT" "2021-06-24 GMT"
[145] "2021-06-28 GMT" "2021-06-29 GMT" "2021-06-30 GMT" "2021-07-01 GMT"
[149] "2021-07-07 GMT" "2021-07-08 GMT" "2021-07-12 GMT" "2022-06-02 GMT"
[153] "2022-06-06 GMT" "2022-06-07 GMT" "2022-06-08 GMT" "2022-06-13 GMT"
[157] "2022-06-14 GMT" "2022-06-15 GMT" "2022-06-20 GMT" "2022-06-21 GMT"
[161] "2022-06-22 GMT" "2022-06-23 GMT" "2022-06-27 GMT" "2022-06-28 GMT"
[165] "2022-06-29 GMT" "2022-06-30 GMT" "2022-07-05 GMT" "2022-07-06 GMT"
[169] "2022-07-07 GMT" "2022-07-08 GMT" "2022-07-11 GMT" "2022-07-12 GMT"
[173] "2022-07-14 GMT" "2022-07-25 GMT" "2022-07-26 GMT" "2022-09-08 GMT"

The isolated data is a data.frame of 86,813 rows and 14 columns. Some information is not required so let’s clean the data a little bit and select information for only one site and a single period of time.

Code
sel <- sel %>% 
  drop_na(scientificName) %>% # Removing NAs in the column of species
  mutate(Date = endDate) %>% 
  separate(endDate, sep = "-", 
           into = c("Year", "Month", "Day"))

unique(sel$Year)
[1] "2014" "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022"
Code
unique(sel$siteID)
[1] "HARV"

Select the site HARV and the year 2022.

Code
HARV <- sel %>% 
  filter(Year == 2022)

unique(HARV$Year)
[1] "2022"
Code
unique(HARV$siteID)
[1] "HARV"
Code
head(HARV, 10)
           namedLocation domainID siteID plotType   plotID subplotID Year Month
1  HARV_034.basePlot.div      D01   HARV    tower HARV_034    31_1_1 2022    06
2  HARV_034.basePlot.div      D01   HARV    tower HARV_034    31_1_1 2022    06
3  HARV_034.basePlot.div      D01   HARV    tower HARV_034    31_1_1 2022    06
4  HARV_034.basePlot.div      D01   HARV    tower HARV_034    40_1_1 2022    06
5  HARV_034.basePlot.div      D01   HARV    tower HARV_034    32_1_2 2022    06
6  HARV_034.basePlot.div      D01   HARV    tower HARV_034    41_1_4 2022    06
7  HARV_033.basePlot.div      D01   HARV    tower HARV_033    40_1_1 2022    06
8  HARV_033.basePlot.div      D01   HARV    tower HARV_033    40_1_3 2022    06
9  HARV_034.basePlot.div      D01   HARV    tower HARV_034    32_1_2 2022    06
10 HARV_033.basePlot.div      D01   HARV    tower HARV_033    40_1_1 2022    06
   Day taxonID  taxonRank        family                         scientificName
1   02    BELE    species    Betulaceae                        Betula lenta L.
2   02    VAAN    species     Ericaceae          Vaccinium angustifolium Aiton
3   02   TRBOB subspecies   Primulaceae Trientalis borealis Raf. ssp. borealis
4   06    TSCA    species      Pinaceae         Tsuga canadensis (L.) Carrière
5   06   ARNU2    species    Araliaceae                   Aralia nudicaulis L.
6   06   COTR2    species Ranunculaceae           Coptis trifolia (L.) Salisb.
7   06   ARNU2    species    Araliaceae                   Aralia nudicaulis L.
8   06    QURU    species      Fagaceae                       Quercus rubra L.
9   06    BELE    species    Betulaceae                        Betula lenta L.
10  06    MEVI    species     Liliaceae                  Medeola virginiana L.
   nativeStatusCode percentCover heightPlantSpecies       Date
1                 N          0.5                 NA 2022-06-02
2                 N          0.5                 NA 2022-06-02
3                 N          0.5                 NA 2022-06-02
4                 N          8.0                 NA 2022-06-06
5                 N         13.0                 NA 2022-06-06
6                 N          1.0                 NA 2022-06-06
7                 N          2.0                 NA 2022-06-06
8                 N          0.5                 NA 2022-06-06
9                 N          0.5                 NA 2022-06-06
10                N          5.0                 NA 2022-06-06

3.2 Check taxonomy

The data corresponding to NEON site HARV for the 2022 contains 1220 rows or observations and 17 columns or variables. If we look at the data, specifically to the column scientificName we can see that the taxonomy used in NEON correspond to the taxonomy used by the USDA, however this taxonomy is not necessarily used by scholars (🤔😵🫣). We generally rely for example on the taxonomy of the APG Angiosperm Phylogeny Group for Angiosperms and WCSP World Checklist of Conifers for Gymnosperms. More recently the initiative Plants of the World was launched and as today is the most reliable source to check the taxonomic names of plant species.

Code
glimpse(HARV)
Rows: 1,220
Columns: 17
$ namedLocation      <chr> "HARV_034.basePlot.div", "HARV_034.basePlot.div", "…
$ domainID           <chr> "D01", "D01", "D01", "D01", "D01", "D01", "D01", "D…
$ siteID             <chr> "HARV", "HARV", "HARV", "HARV", "HARV", "HARV", "HA…
$ plotType           <chr> "tower", "tower", "tower", "tower", "tower", "tower…
$ plotID             <chr> "HARV_034", "HARV_034", "HARV_034", "HARV_034", "HA…
$ subplotID          <chr> "31_1_1", "31_1_1", "31_1_1", "40_1_1", "32_1_2", "…
$ Year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
$ Month              <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06…
$ Day                <chr> "02", "02", "02", "06", "06", "06", "06", "06", "06…
$ taxonID            <chr> "BELE", "VAAN", "TRBOB", "TSCA", "ARNU2", "COTR2", …
$ taxonRank          <chr> "species", "species", "subspecies", "species", "spe…
$ family             <chr> "Betulaceae", "Ericaceae", "Primulaceae", "Pinaceae…
$ scientificName     <chr> "Betula lenta L.", "Vaccinium angustifolium Aiton",…
$ nativeStatusCode   <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "…
$ percentCover       <dbl> 0.5, 0.5, 0.5, 8.0, 13.0, 1.0, 2.0, 0.5, 0.5, 5.0, …
$ heightPlantSpecies <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Date               <dttm> 2022-06-02, 2022-06-02, 2022-06-02, 2022-06-06, 20…

In any case, we need to standardize the species names in order to proceed with the exercise.

To standardize the species names, you can use the steps described HERE.

However, for simplicity we will use the Taxonomic Name Resolution Service (TNRS) tool, that basically is a web interface that used the Plants of the World as a base source.

The only thing we need to do is to copy the scientific names and paste on the TNRS tool.

Code
# vector with scientific names
spp <- HARV %>% 
  select(scientificName) %>% 
  distinct()

# print the species names
head(spp, 10)
                           scientificName
1                         Betula lenta L.
2           Vaccinium angustifolium Aiton
3  Trientalis borealis Raf. ssp. borealis
4          Tsuga canadensis (L.) Carrière
5                    Aralia nudicaulis L.
6            Coptis trifolia (L.) Salisb.
7                        Quercus rubra L.
8                   Medeola virginiana L.
9                     Mitchella repens L.
10              Trillium undulatum Willd.
Code
# save names and open with excel or numbers
write_csv(spp, 
          file = "../data/NEON/HARV_spp_names_2022.csv")

The next step is to open the file HARV_spp_names_2022.csv. Then, copy the species names and paste them into the TRNL website. Next, click on the submit tab and then click on the Download Data tab. A small window will pop out (see image below), and click in DOWNLOAD.

TRNL

A .CSV file named tnrs_result.csv will be downloaded to your Downloads folder. Please copy the file and paste it into the folder data/NEON of this tutorial.

Now we will upload the standardized species names through the TRNL tool and match these names with the ones in the species composition dataset or HARV data.

Code
spp_check <- read_csv("../data/NEON/tnrs_result.csv")
Rows: 199 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): Name_submitted, Name_matched, Name_matched_rank, Author_submitted,...
dbl (13): ID, Overall_score, Name_matched_id, Name_score, Author_score, Genu...
lgl  (8): Infraspecific_rank_2, Infraspecific_epithet_2_matched, Infraspecif...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
head(spp_check, 10)
# A tibble: 10 × 48
      ID Name_submitted    Overall_score Name_matched_id Name_matched Name_score
   <dbl> <chr>                     <dbl>           <dbl> <chr>             <dbl>
 1     1 Betula lenta L.           1              329014 Betula lenta      1    
 2     1 Betula lenta L.           0.8            329003 Betula lenta      1    
 3     2 Vaccinium angust…         1              417184 Vaccinium a…      1    
 4     2 Vaccinium angust…         0.8            417185 Vaccinium a…      1    
 5     3 Trientalis borea…         0.960         1090934 Trientalis …      0.960
 6     3 Trientalis borea…         0.960         1090932 Trientalis …      0.960
 7     4 Tsuga canadensis…         0.976          450324 Tsuga canad…      1    
 8     5 Aralia nudicauli…         1              253544 Aralia nudi…      1    
 9     5 Aralia nudicauli…         0.8            253535 Aralia nudi…      1    
10     6 Coptis trifolia …         0.905          612855 Coptis trif…      1    
# ℹ 42 more variables: Name_matched_rank <chr>, Author_submitted <chr>,
#   Author_matched <chr>, Author_score <dbl>, Canonical_author <chr>,
#   Name_matched_accepted_family <chr>, Genus_submitted <chr>,
#   Genus_matched <chr>, Genus_score <dbl>, Specific_epithet_submitted <chr>,
#   Specific_epithet_matched <chr>, Specific_epithet_score <dbl>,
#   Family_submitted <chr>, Family_matched <chr>, Family_score <dbl>,
#   Infraspecific_rank <chr>, Infraspecific_epithet_matched <chr>, …

Select the necessary information and combine it with the HARV data.

Code
taxonomy <- spp_check %>% 
  select(Name_submitted, Accepted_family, Genus_matched, 
         Specific_epithet_matched, Accepted_species) %>% 
  mutate(Specific_epithet_matched = if_else(is.na(Specific_epithet_matched), 
                                            "sp.", Specific_epithet_matched)) %>% 
  mutate(species = paste(Genus_matched, Specific_epithet_matched)) %>% 
  rename(genus = Genus_matched) %>% 
  drop_na(Accepted_family, genus) %>% 
  distinct(species, .keep_all = TRUE)

taxonomy$Name_submitted[4] <- "Tsuga canadensis (L.) Carrière"
taxonomy$Name_submitted[16] <- "Ilex mucronata (L.) Powell, Savolainen & Andrews"
#taxonomy$Name_submitted[55] <- "Lysimachia terrestris (L.) Britton, Sterns & Poggenb."
taxonomy$Name_submitted[77] <- "Lycopodium hickeyi W.H. Wagner, Beitel & Moran"

Now let’s combine the new taxonomic information with our NEON data.

Code
HARV_data <- right_join(x = HARV, 
                       y = taxonomy, 
                       by = c("scientificName" = "Name_submitted"))

We can select some specific columns that we will use from now on.

Code
HARV_data <- HARV_data %>% 
  select(siteID, plotID, subplotID, Accepted_family, genus, species, percentCover) %>% 
  drop_na(species)

head(HARV_data)
  siteID   plotID subplotID Accepted_family      genus                 species
1   HARV HARV_034    31_1_1      Betulaceae     Betula            Betula lenta
2   HARV HARV_034    31_1_1       Ericaceae  Vaccinium Vaccinium angustifolium
3   HARV HARV_034    31_1_1     Primulaceae Trientalis     Trientalis borealis
4   HARV HARV_034    40_1_1        Pinaceae      Tsuga        Tsuga canadensis
5   HARV HARV_034    32_1_2      Araliaceae     Aralia       Aralia nudicaulis
6   HARV HARV_034    41_1_4   Ranunculaceae     Coptis         Coptis trifolia
  percentCover
1          0.5
2          0.5
3          0.5
4          8.0
5         13.0
6          1.0

3.3 Get phylogenetic hypothesis

Alright, until here we have downloaded, cleaned and prepared plant community data for the NEON site HARV. The next step is to prepare the phylogeny for those communities or a community level phylogeny. To do that we will use the most up to date phylogeny of vascular plants Constructing a broadly inclusive seed plant phylogeny and the R package {V.PhyloMaker}.

3.3.1 Get taxonomic table

Code
# Prepare the taxonomy data to extract the phylogeny
sppPhylo <- HARV_data %>% 
  distinct(Accepted_family, genus, species) %>% # identify unique values
  select(species, genus, family = Accepted_family)  # select columns
  
head(sppPhylo, 10)
                   species      genus        family
1             Betula lenta     Betula    Betulaceae
2  Vaccinium angustifolium  Vaccinium     Ericaceae
3      Trientalis borealis Trientalis   Primulaceae
4         Tsuga canadensis      Tsuga      Pinaceae
5        Aralia nudicaulis     Aralia    Araliaceae
6          Coptis trifolia     Coptis Ranunculaceae
7            Quercus rubra    Quercus      Fagaceae
8       Medeola virginiana    Medeola     Liliaceae
9         Mitchella repens  Mitchella     Rubiaceae
10      Trillium undulatum   Trillium Melanthiaceae

3.3.2 Get phylogeny

Prepare the phylogeny and plot it.

Code
result <- V.PhyloMaker::phylo.maker(sppPhylo, 
                                    scenarios = "S3") # this will take some time.
[1] "Taxonomic classification not consistent between sp.list and tree."
      genus family_in_sp.list family_in_tree
7  Athyrium      Aspleniaceae    Athyriaceae
86 Viburnum       Viburnaceae      Adoxaceae

Explore the results

See which species were added to the megaphylogeny

Code
glimpse(result$species.list)
Rows: 136
Columns: 4
$ species <chr> "Betula lenta", "Vaccinium angustifolium", "Trientalis boreali…
$ genus   <chr> "Betula", "Vaccinium", "Trientalis", "Tsuga", "Aralia", "Copti…
$ family  <chr> "Betulaceae", "Ericaceae", "Primulaceae", "Pinaceae", "Araliac…
$ status  <chr> "prune", "bind", "prune", "prune", "prune", "prune", "prune", …

Explore the phylogeny

Code
str(result$scenario.3)
List of 5
 $ edge       : int [1:260, 1:2] 137 138 139 140 141 142 143 144 145 146 ...
 $ edge.length: num [1:260] 10.09 65.65 189.29 4.08 7.95 ...
 $ Nnode      : int 125
 $ node.label : chr [1:125] "" "" "Spermatophyta" "mrcaott2ott121" ...
 $ tip.label  : chr [1:136] "Bidens_frondosa" "Eurybia_divaricata" "Symphyotrichum_lanceolatum" "Solidago_rugosa" ...
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"

Double check the results

Code
phylo <- multi2di(result$scenario.3)

# Check if our phylogeny is ultrametric
is.ultrametric(phylo)
[1] TRUE
Code
# Check is our phylogeny is bifurcated 
is.binary.phylo(phylo)
[1] TRUE
Code
plot(phylo, show.tip.label = FALSE)

axisPhylo()

Save results

Code
dir.create("../output/Phylogeny")

# Save taxonomy
write_csv(result$species.list, 
          file = "../output/Phylogeny/HARV_taxonomic_constraint.csv")

# Save phylogeny
write.nexus(result$scenario.3, 
            file = "../output/Phylogeny/HARV_phylogeny.nex")

Clean environment

Code
rm(list = ls())

4 Part 2 - Intro to phylogenies and the comparative method

Load the data. Instead of reading files from the computer we will pull the required data directly from the internet.

Code
## Trait data
nimbiosData <- read_csv("../data/NIMBios_spectra_sample.csv") %>% 
  column_to_rownames("species")
Rows: 136 Columns: 202
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (1): species
dbl (201): 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
## Phylogenetic data
nimbiosTree <- read.nexus("../data/NIMBioS_phylogeny.nex")
The pipe (%>%) operator

This operator is, maybe, the most used operator from the {dplyr} package and is used to perform a sequence of operations on a data frame. In other words, the pipe operator simply feeds the results of one operation into the next operation below it.

OK. You should be ready to go.

Let’s inspect the data first, you can use the function “glimpse()” of the R package {dplyr}

Code
nimbiosData[1:10, 1:5]
                             400         410         420         430
Tilia_platyphyllos   0.013736170 0.014438395 0.015676539 0.016526365
Lonicera             0.007564401 0.007685526 0.007613202 0.007632988
Inga_alba            0.012211844 0.011661123 0.011313618 0.011105918
Annona_andicola      0.011549314 0.011007158 0.010648428 0.010446585
Guadua_sarcocarpa    0.013791418 0.013306980 0.012992283 0.012803944
Picea_pungens        0.007563849 0.007749505 0.008309924 0.009127325
Cornus_florida       0.006943540 0.006829403 0.006699217 0.006543869
Buchenavia_tomentosa 0.012438235 0.012039175 0.011888195 0.011833785
Crataegus_mollis     0.009663496 0.009470150 0.009422308 0.009499938
Prunus_lusitanica    0.010370193 0.010343476 0.010503806 0.010605137
                             440
Tilia_platyphyllos   0.016946896
Lonicera             0.007581694
Inga_alba            0.010982383
Annona_andicola      0.010354256
Guadua_sarcocarpa    0.012668808
Picea_pungens        0.009779256
Cornus_florida       0.006499184
Buchenavia_tomentosa 0.011811240
Crataegus_mollis     0.009524536
Prunus_lusitanica    0.010670988

Let’s check if our trait data contain the same species as the phylogeny

Code
tmp <- name.check(phy = nimbiosTree, data = nimbiosData)

# print the results
head(tmp$tree_not_data, 10)
 [1] "Abies_concolor"       "Abies_magnifica"      "Acacia_altiscandens" 
 [4] "Acer_negundo"         "Acer_platanoides"     "Acer_pseudoplatanus" 
 [7] "Acer_rubrum"          "Acer_saccharinum"     "Acer_saccharum"      
[10] "Achillea_millefolium"
Code
tmp$data_not_tree
character(0)

It indicates that over 408 species are not present in the spectra data, so let’s drop these species from the phylogeny. To do that we will use the function drop.tip() of the package {ape}

Code
nimbiosTree <- drop.tip(phy = nimbiosTree, tip = tmp$tree_not_data)

# print resulting tree
nimbiosTree

Phylogenetic tree with 136 tips and 135 internal nodes.

Tip labels:
  Ficus_coerulescens, Ficus_caballina, Ficus_americana, Pseudolmedia_macrophylla, Castilla_ulei, Sorocea_briquetii, ...
Node labels:
  NA, NA, NA, NA, NA, NA, ...

Rooted; includes branch lengths.

We can double check if our data match after dropping the missing species

Code
name.check(phy = nimbiosTree, data = nimbiosData)
[1] "OK"

Now it seems that we are ready to go!

4.1 Working with trees

Let’s start by looking at the phylogeny of these birds and learning a bit about how to work with trees in R.

What does your tree look like?

Code
plot(ladderize(nimbiosTree))

axisPhylo()

Whoa. That’s ugly. Let’s clean it up.

Code
plot.phylo(ladderize(nimbiosTree), 
           no.margin = TRUE, 
           cex = 0.5)

axisPhylo()

Better. You can mess around with tree plotting functions in plot.phylo() as much as you’d like. Try this for example:

Code
plot.phylo(nimbiosTree, 
           type = "fan", 
           no.margin = TRUE, 
           cex = 0.5)

Much much better.

It may be useful to understand how trees are encoded in R. Typing in just the name of the tree file like this:

Code
nimbiosTree

Phylogenetic tree with 136 tips and 135 internal nodes.

Tip labels:
  Ficus_coerulescens, Ficus_caballina, Ficus_americana, Pseudolmedia_macrophylla, Castilla_ulei, Sorocea_briquetii, ...
Node labels:
  NA, NA, NA, NA, NA, NA, ...

Rooted; includes branch lengths.

that will give you basic information about the phylogeny: the number of tips and nodes; what the tips are called; whether the tree is rooted; and if it has branch lengths.

Code
str(nimbiosTree)
List of 5
 $ edge       : int [1:270, 1:2] 137 138 139 140 141 142 143 144 145 146 ...
 $ edge.length: num [1:270] 191.138 0.438 32.448 1.735 7.887 ...
 $ Nnode      : int 135
 $ node.label : chr [1:135] "NA" "NA" "NA" "NA" ...
 $ tip.label  : chr [1:136] "Ficus_coerulescens" "Ficus_caballina" "Ficus_americana" "Pseudolmedia_macrophylla" ...
 - attr(*, "class")= chr "phylo"
 - attr(*, "order")= chr "cladewise"

will tell you more about tree structure. Trees consist of tips connected by edges (AKA branches)

Code
nimbiosTree$tip.label
  [1] "Ficus_coerulescens"          "Ficus_caballina"            
  [3] "Ficus_americana"             "Pseudolmedia_macrophylla"   
  [5] "Castilla_ulei"               "Sorocea_briquetii"          
  [7] "Cecropia_membranacea"        "Celtis_occidentalis"        
  [9] "Ulmus_glabra"                "Ceanothus_cordulatus"       
 [11] "Ceanothus_americanus"        "Crataegus_mollis"           
 [13] "Prunus_serotina"             "Prunus_lusitanica"          
 [15] "Prunus"                      "Prunus_pensylvanica"        
 [17] "Rubus_flagellaris"           "Quercus_hemisphaerica"      
 [19] "Quercus_nigra"               "Quercus_palustris"          
 [21] "Quercus_douglasii"           "Quercus_virginiana"         
 [23] "Castanea_sativa"             "Betula_lenta"               
 [25] "Betula_glandulosa"           "Alnus_glutinosa"            
 [27] "Lespedeza_capitata"          "Desmodium_glutinosum"       
 [29] "Trifolium_repens"            "Maackia_amurensis"          
 [31] "Vatairea_fusca"              "Swartzia"                   
 [33] "Swartzia_leptopetala"        "Swartzia_arborescens"       
 [35] "Dipteryx_alata"              "Dussia_tessmannii"          
 [37] "Inga_coruscans"              "Inga_splendens"             
 [39] "Inga_cayennensis"            "Inga_rhynchocalyx"          
 [41] "Inga_alba"                   "Abarema_jupunba"            
 [43] "Schizolobium_parahyba"       "Tachigali_paniculata"       
 [45] "Tachigali_chrysaloides"      "Symphonia_globulifera"      
 [47] "Sloanea_latifolia"           "Euonymus_fortunei"          
 [49] "Acer_spicatum"               "Talisia"                    
 [51] "Rhus"                        "Tetragastris_panamensis"    
 [53] "Protium_amazonicum"          "Septotheca_tessmannii"      
 [55] "Tilia_platyphyllos"          "Theobroma_cacao"            
 [57] "Bixa_arborea"                "Jacaratia_digitata"         
 [59] "Turpinia_occidentalis"       "Eugenia"                    
 [61] "Qualea_tessmannii"           "Buchenavia_tomentosa"       
 [63] "Liquidambar"                 "Liquidambar_styraciflua"    
 [65] "Godmania_aesculifolia"       "Jacaranda_copaia"           
 [67] "Callicarpa_bodinieri"        "Verbena_stricta"            
 [69] "Linaria_vulgaris"            "Fraxinus_excelsior"         
 [71] "Syringa_reticulata"          "Ligustrum_vulgare"          
 [73] "Cordia"                      "Aspidosperma_excelsum"      
 [75] "Aspidosperma_capitatum"      "Tabernaemontana_cymosa"     
 [77] "Asclepias_incarnata"         "Apocynum_cannabinum"        
 [79] "Solidago_flexicaulis"        "Antennaria_neglecta"        
 [81] "Ambrosia_trifida"            "Sonchus"                    
 [83] "Viburnum_prunifolium"        "Viburnum_rhytidophyllum"    
 [85] "Lonicera"                    "Schefflera_arboricola"      
 [87] "Pouteria_guianensis"         "Ecclinusa_lanceolata"       
 [89] "Chrysophyllum_lucentifolium" "Diospyros_pseudoxylopia"    
 [91] "Rhododendron_calophytum"     "Rhododendron_maximum"       
 [93] "Kalmia_latifolia"            "Eschweilera_albiflora"      
 [95] "Bertholletia_excelsa"        "Cariniana_estrellensis"     
 [97] "Cornus_alba"                 "Cornus_florida"             
 [99] "Neea_aeruginosa"             "Triplaris_poeppigiana"      
[101] "Rumex_acetosella"            "Polygonum_convolvulus"      
[103] "Minquartia_guianensis"       "Meliosma_herbertii"         
[105] "Anemone_cylindrica"          "Actaea_pachypoda"           
[107] "Thalictrum_dioicum"          "Abuta_pahnii"               
[109] "Panicum_virgatum"            "Panicum_oligosanthes"       
[111] "Buchloe_dactyloides"         "Dactylis"                   
[113] "Guadua_sarcocarpa"           "Attalea_butyracea"          
[115] "Mauritia_flexuosa"           "Arisaema_triphyllum"        
[117] "Guatteria_schunkevigoi"      "Guatteria_alutacea"         
[119] "Guatteria_acutissima"        "Annona_andicola"            
[121] "Oxandra_major"               "Unonopsis_floribunda"       
[123] "Virola_caducifolia"          "Virola_surinamensis"        
[125] "Nectandra_longifolia"        "Pleurothyrium_williamsii"   
[127] "Ocotea_longifolia"           "Ocotea_oblonga"             
[129] "Sassafras_albidum"           "Asarum_canadense"           
[131] "Pinus_jeffreyi"              "Pinus_banksiana"            
[133] "Picea_pungens"               "Abies_balsamea"             
[135] "Tsuga_canadensis"            "Thuja_occidentalis"         

gives you a list of all your terminal taxa, which are by default numbered 1-n, where n is the number of taxa.

Code
nimbiosTree$Nnode
[1] 135

gives you the number of nodes. This is a fully bifurcating rooted tree, so it has 1 fewer node than the number of taxa.

Code
head(nimbiosTree$edge, 10)
      [,1] [,2]
 [1,]  137  138
 [2,]  138  139
 [3,]  139  140
 [4,]  140  141
 [5,]  141  142
 [6,]  142  143
 [7,]  143  144
 [8,]  144  145
 [9,]  145  146
[10,]  146  147

This tells you the beginning and ending node for all edges.

Put that all together with the following lines

Code
plot.phylo(nimbiosTree, 
           type = "fan", 
           no.margin = TRUE, 
           cex = 0.5, 
           label.offset = 0.1, 
           show.tip.label = TRUE)

nodelabels(cex = 0.3)

Code
#tiplabels(cex = 0.5)

There are many ways to manipulate trees in R using {ape}, {Phytools}, and other packages. This just gives you a bare-bones introduction.

4.2 Working with a data matrix and testing hypotheses in a phylogenetically informed way

Let’s ask some questions using the spectra data that were extracted from NEON. First, explore the data in the “nimbiosData” matrix. Here are some options for visualizing data matrices:

Code
#nimbiosData # this will show you the first few rows of your data matrix and its header

nimbiosData %>% 
  dimnames() # this will show you the row and column headers for your matrix
[[1]]
  [1] "Tilia_platyphyllos"          "Lonicera"                   
  [3] "Inga_alba"                   "Annona_andicola"            
  [5] "Guadua_sarcocarpa"           "Picea_pungens"              
  [7] "Cornus_florida"              "Buchenavia_tomentosa"       
  [9] "Crataegus_mollis"            "Prunus_lusitanica"          
 [11] "Tsuga_canadensis"            "Unonopsis_floribunda"       
 [13] "Neea_aeruginosa"             "Godmania_aesculifolia"      
 [15] "Schefflera_arboricola"       "Quercus_virginiana"         
 [17] "Thalictrum_dioicum"          "Aspidosperma_capitatum"     
 [19] "Linaria_vulgaris"            "Anemone_cylindrica"         
 [21] "Alnus_glutinosa"             "Abarema_jupunba"            
 [23] "Acer_spicatum"               "Bertholletia_excelsa"       
 [25] "Actaea_pachypoda"            "Solidago_flexicaulis"       
 [27] "Tabernaemontana_cymosa"      "Thuja_occidentalis"         
 [29] "Callicarpa_bodinieri"        "Guatteria_schunkevigoi"     
 [31] "Betula_glandulosa"           "Eugenia"                    
 [33] "Trifolium_repens"            "Pleurothyrium_williamsii"   
 [35] "Cecropia_membranacea"        "Ulmus_glabra"               
 [37] "Pouteria_guianensis"         "Panicum_virgatum"           
 [39] "Betula_lenta"                "Ficus_americana"            
 [41] "Ecclinusa_lanceolata"        "Rhus"                       
 [43] "Bixa_arborea"                "Oxandra_major"              
 [45] "Chrysophyllum_lucentifolium" "Rhododendron_maximum"       
 [47] "Ocotea_oblonga"              "Talisia"                    
 [49] "Abies_balsamea"              "Asclepias_incarnata"        
 [51] "Dipteryx_alata"              "Swartzia"                   
 [53] "Verbena_stricta"             "Inga_coruscans"             
 [55] "Ceanothus_cordulatus"        "Celtis_occidentalis"        
 [57] "Swartzia_leptopetala"        "Castanea_sativa"            
 [59] "Tachigali_paniculata"        "Qualea_tessmannii"          
 [61] "Liquidambar"                 "Apocynum_cannabinum"        
 [63] "Prunus_pensylvanica"         "Schizolobium_parahyba"      
 [65] "Triplaris_poeppigiana"       "Ocotea_longifolia"          
 [67] "Aspidosperma_excelsum"       "Swartzia_arborescens"       
 [69] "Ligustrum_vulgare"           "Dactylis"                   
 [71] "Inga_rhynchocalyx"           "Abuta_pahnii"               
 [73] "Rhododendron_calophytum"     "Quercus_palustris"          
 [75] "Pinus_jeffreyi"              "Ficus_caballina"            
 [77] "Guatteria_alutacea"          "Jacaranda_copaia"           
 [79] "Dussia_tessmannii"           "Inga_splendens"             
 [81] "Ambrosia_trifida"            "Viburnum_rhytidophyllum"    
 [83] "Ceanothus_americanus"        "Septotheca_tessmannii"      
 [85] "Cornus_alba"                 "Lespedeza_capitata"         
 [87] "Arisaema_triphyllum"         "Fraxinus_excelsior"         
 [89] "Pinus_banksiana"             "Sonchus"                    
 [91] "Theobroma_cacao"             "Antennaria_neglecta"        
 [93] "Quercus_hemisphaerica"       "Diospyros_pseudoxylopia"    
 [95] "Minquartia_guianensis"       "Protium_amazonicum"         
 [97] "Desmodium_glutinosum"        "Asarum_canadense"           
 [99] "Buchloe_dactyloides"         "Sorocea_briquetii"          
[101] "Virola_surinamensis"         "Jacaratia_digitata"         
[103] "Tachigali_chrysaloides"      "Ficus_coerulescens"         
[105] "Virola_caducifolia"          "Quercus_nigra"              
[107] "Cordia"                      "Nectandra_longifolia"       
[109] "Liquidambar_styraciflua"     "Vatairea_fusca"             
[111] "Eschweilera_albiflora"       "Rubus_flagellaris"          
[113] "Sloanea_latifolia"           "Turpinia_occidentalis"      
[115] "Viburnum_prunifolium"        "Attalea_butyracea"          
[117] "Panicum_oligosanthes"        "Quercus_douglasii"          
[119] "Rumex_acetosella"            "Euonymus_fortunei"          
[121] "Prunus"                      "Syringa_reticulata"         
[123] "Symphonia_globulifera"       "Castilla_ulei"              
[125] "Mauritia_flexuosa"           "Polygonum_convolvulus"      
[127] "Cariniana_estrellensis"      "Inga_cayennensis"           
[129] "Maackia_amurensis"           "Sassafras_albidum"          
[131] "Meliosma_herbertii"          "Pseudolmedia_macrophylla"   
[133] "Kalmia_latifolia"            "Prunus_serotina"            
[135] "Guatteria_acutissima"        "Tetragastris_panamensis"    

[[2]]
  [1] "400"  "410"  "420"  "430"  "440"  "450"  "460"  "470"  "480"  "490" 
 [11] "500"  "510"  "520"  "530"  "540"  "550"  "560"  "570"  "580"  "590" 
 [21] "600"  "610"  "620"  "630"  "640"  "650"  "660"  "670"  "680"  "690" 
 [31] "700"  "710"  "720"  "730"  "740"  "750"  "760"  "770"  "780"  "790" 
 [41] "800"  "810"  "820"  "830"  "840"  "850"  "860"  "870"  "880"  "890" 
 [51] "900"  "910"  "920"  "930"  "940"  "950"  "960"  "970"  "980"  "990" 
 [61] "1000" "1010" "1020" "1030" "1040" "1050" "1060" "1070" "1080" "1090"
 [71] "1100" "1110" "1120" "1130" "1140" "1150" "1160" "1170" "1180" "1190"
 [81] "1200" "1210" "1220" "1230" "1240" "1250" "1260" "1270" "1280" "1290"
 [91] "1300" "1310" "1320" "1330" "1340" "1350" "1360" "1370" "1380" "1390"
[101] "1400" "1410" "1420" "1430" "1440" "1450" "1460" "1470" "1480" "1490"
[111] "1500" "1510" "1520" "1530" "1540" "1550" "1560" "1570" "1580" "1590"
[121] "1600" "1610" "1620" "1630" "1640" "1650" "1660" "1670" "1680" "1690"
[131] "1700" "1710" "1720" "1730" "1740" "1750" "1760" "1770" "1780" "1790"
[141] "1800" "1810" "1820" "1830" "1840" "1850" "1860" "1870" "1880" "1890"
[151] "1900" "1910" "1920" "1930" "1940" "1950" "1960" "1970" "1980" "1990"
[161] "2000" "2010" "2020" "2030" "2040" "2050" "2060" "2070" "2080" "2090"
[171] "2100" "2110" "2120" "2130" "2140" "2150" "2160" "2170" "2180" "2190"
[181] "2200" "2210" "2220" "2230" "2240" "2250" "2260" "2270" "2280" "2290"
[191] "2300" "2310" "2320" "2330" "2340" "2350" "2360" "2370" "2380" "2390"
[201] "2400"
Code
#nimbiosData %>% 
 # View() # this will let you visualize the entire matrix

Hyperspectral data is composed of hundreds of bands let’s isolate a single band to work so we can work with it easily:

Code
wl400 <- nimbiosData[, "400"] 
names(wl400) <- rownames(nimbiosData) 
# data vectors have to be labelled with tip names for the associated tree. 
# This is how to do that. 
Exploring the data

It is good practice to check the distribution of your data before doing downstream analysis.

Code
hist(wl400)

When we work with comparative data we start exploring different sets of evolutionary models. The most common is the model of character evolution, called Brownian Motion model. This model assumes that a trait evolves from a starting state (z0) according to a random walk with the variance specified by the rate parameter \(\sigma^{2}\) (sigma-squared). In short, Brownian motion describes a process in which tip states are modeled under the assumption of a multivariate normal distribution. On a phylogeny, the multivariate mean of tip states is equal to the root state estimate, and variance accumulates linearly through time.

What does Brownian Motion evolution of hand-wing index in Furnariides look like?

Code
brownianModel <- fitContinuous(phy = nimbiosTree, 
                               dat = wl400)

brownianModel # this will show you the fit statistics and parameter values
GEIGER-fitted comparative model of continuous data
 fitted 'BM' model parameters:
    sigsq = 0.000000
    z0 = 0.010422

 model summary:
    log-likelihood = 568.550267
    AIC = -1133.100533
    AICc = -1133.010308
    free parameters = 2

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    number of iterations with same best fit = 100
    frequency of best fit = 1.000

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates

Here, you can see the estimates for ancestral state (z0), and the rate parameter (\(\sigma^{2}\)), as well as some measures of model fit. The fit of the model is determined using maximum likelihood, and expressed as a log likelihood (lnL). The higher the lnL, the more probable the data given the model. However, when comparing different models, we can’t use the lnL, because it does not account for the difference in the number of parameters among models. Models with more parameters will always fit better, but do they fit significantly better? For example an OU model has 4 parameters (alpha [\(\alpha\)], theta [\(\theta\)], z0, and sigma-squared [\(\sigma^{2}\)]), so it should fit better than a BM model, which includes only z0 and sigsq. To account for this, statisticians have developed another measure of fit called the AIC (Akaike Information Criterion): AIC = (2xN)-2xlnL, where N is the number of parameters. This penalizes the likelihood score for adding parameters. When selecting among a set of models, the one with the lowest AIC is preferred. We will use this information later on in this lab.

In addition to assessing model fit, we can use the Brownian Motion model to reconstruct ancestral states of a character on a tree. To visualize what BM evolution of this trait looks like on a tree. The contMap() command in {phytools} estimates the ancestral states and plots them on a tree.

Code
## Calculate number of trait shifts
obj <- contMap(nimbiosTree, 
        wl400, 
        fsize = 0.1, 
        lwd = 1.5, 
        type = "fan", 
        plot = FALSE)

# change colors
obj <- setMap(obj, 
              c("white", "#FFFFB2", "#FECC5C", "#FD8D3C", "#E31A1C")) 

# Plot the results
plot(obj, 
     fsize = c(0.1, 0.8), 
     leg.txt = "Wavelength 402")

4.3 Model Fitting

Brownian Motion is only one model of evolution for a continuous variable. Another model is the Ornstein-Uhlenbeck (OU) model, which allows the trait mean to evolve towards a new state (theta), with a selective force (alpha). These two new parameters, plus the starting state (z0) and the rate of evolution (sigsq) parameters from the BM model, make for a 4-parameter model. The Early Burst (EB) model allows the rate of evolution to change across the tree, where the early rate of evolution is high and declines over time (presumably as niches are filled during an adaptive radiation. The rate of evolution changes exponentially over time and is specified under the model r[t] = r[0] x exp(a x t), where r[0] is the initial rate, a is the rate change parameter, and t is time. The maximum bound is set to -0.000001, representing a decelerating rate of evolution. The minimum bound is set to \(log(10^{-5})\)/depth of the tree.

Let’s evaluate the relative fit of these three models to the Hand-wing index trait.

4.3.1 Brownian Motion (BM)

Code
brownianModel <- fitContinuous(phy = nimbiosTree, 
                               dat = wl400, # trait 
                               model = "BM") # evolutionary model

4.3.2 Ornstein-Uhlenbeck (OU)

Code
OUModel <- fitContinuous(phy = nimbiosTree, 
                               dat = wl400, 
                         model = "OU")

4.3.3 Early Burst (EB)

Code
EBModel <- fitContinuous(phy = nimbiosTree, 
                         dat = wl400, 
                         model = "EB")
Warning in fitContinuous(phy = nimbiosTree, dat = wl400, model = "EB"): 
Parameter estimates appear at bounds:
    a

And recover the parameter values and fit estimates.

Code
brownianModel
GEIGER-fitted comparative model of continuous data
 fitted 'BM' model parameters:
    sigsq = 0.000000
    z0 = 0.010422

 model summary:
    log-likelihood = 568.550267
    AIC = -1133.100533
    AICc = -1133.010308
    free parameters = 2

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    number of iterations with same best fit = 100
    frequency of best fit = 1.000

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates
Code
OUModel
GEIGER-fitted comparative model of continuous data
 fitted 'OU' model parameters:
    alpha = 0.106864
    sigsq = 0.000002
    z0 = 0.011790

 model summary:
    log-likelihood = 617.661473
    AIC = -1229.322946
    AICc = -1229.141127
    free parameters = 3

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    number of iterations with same best fit = 23
    frequency of best fit = 0.230

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates
Code
EBModel
GEIGER-fitted comparative model of continuous data
 fitted 'EB' model parameters:
    a = -0.000001
    sigsq = 0.000000
    z0 = 0.010422

 model summary:
    log-likelihood = 568.548380
    AIC = -1131.096760
    AICc = -1130.914942
    free parameters = 3

Convergence diagnostics:
    optimization iterations = 100
    failed iterations = 0
    number of iterations with same best fit = 10
    frequency of best fit = 0.100

 object summary:
    'lik' -- likelihood function
    'bnd' -- bounds for likelihood search
    'res' -- optimization iteration summary
    'opt' -- maximum likelihood parameter estimates

Compare all models and select the best fitting model. To to that, we will use AIC model comporison approach based on weights.

Code
# Vector of models
mods <- c(brownianModel$opt$aicc, OUModel$opt$aicc, EBModel$opt$aicc)
# rename the models
names(mods) <- c("BM", "OU", "EB")

# Run AIC weights 
aicw(mods)
         fit    delta            w
BM -1133.010 96.13082 1.334928e-21
OU -1229.141  0.00000 1.000000e+00
EB -1130.915 98.22619 4.682252e-22

According to the model comparison using AIC, the model that best fits the data is the OU model. The OU model also shows a slightly higher lnL value compared to the BM and EB models of evolution.

Code
library(knitr)

aic_wl400 <- aicw(mods) 
names(aic_wl400)[1] <- "AIC"

aic_wl400$lnL <- c(brownianModel$opt$lnL, OUModel$opt$lnL, EBModel$opt$lnL)

kable(aic_wl400)
AIC delta w lnL
BM -1133.010 96.13082 0 568.5503
OU -1229.141 0.00000 1 617.6615
EB -1130.915 98.22619 0 568.5484

4.4 Phylogenetic signal

Phylogenetic signal is the tendency of related species to resemble each other more than species drawn at random from the same tree.

4.4.1 Blomberg’s K

Blomberg’s K compares the variance of PICs to what we would expect under a Brownian motion (BM) model of evolution. K = 1 means that close relatives resemble each other as much as we should expect under BM. K < 1 that there is less phylogenetic signal than expected under BM and K > 1 means that there is more. In addition, a significant p-value returned from a randomization test tells us that the phylogenetic signal is significant, in other words, close relatives are more similar than random pairs of taxa in the dataset.

Code
# Run Blomber's K
K_wl400 <- phylosig(tree = nimbiosTree, # Phylogeny
                  x = wl400, # trait
                  method = "K", # method
                  test = TRUE)

# Print results
print(K_wl400)

Phylogenetic signal K : 0.12958 
P-value (based on 1000 randomizations) : 0.006 
Code
# Plot results
plot(K_wl400)

4.4.2 Pagel’s Lambda

Pagel’s \(\lambda\) is a tree transformation that stretches the tip branches relative to internal branches, making the tree more and more like a complete polytomy of a star phylogeny. If \(\lambda = 0\) there is no phylogenetic signal, while \(\lambda = 1\) correspond to BM and \(0 < \lambda < 1\) in between.

Code
# Run Pagel's Lambda
LB_wl400 <- phylosig(tree = nimbiosTree, # Phylogeny
                  x = wl400, # trait
                  method = "lambda", 
                  test = TRUE)

# Print the results
print(LB_wl400)

Phylogenetic signal lambda : 0.797976 
logL(lambda) : 625.571 
LR(lambda=0) : 18.6032 
P-value (based on LR test) : 1.60954e-05 
Code
# Plot thre results
plot(LB_wl400)

Wavelength 402 presents a low phylogenetic signal according to both metrics. Pagel’s \(\lambda\) = 0.8. Blomberg’s K = 0.13 suggests that wavelength 402 presents phylogenetic signal but not as expected under the BM model of evolution. Importantly, both metrics deviate from the white noise null model.

4.5 Phylogenetic signal for all wavelengths

Double check the match between the spectra data and the phylogeny

Code
### Check info
setdiff(nimbiosTree$tip.label, rownames(nimbiosData))
character(0)
Code
### Assign rownames to column
nimbiosData <- nimbiosData %>% 
  rownames_to_column("species")

Now we will use a wrapper function to run multiple wavelenghts, but for the sake of sanity we will just run a sample of 10.

Code
### Run Blomberg's K 
### Load wrapper functions
source("https://raw.githubusercontent.com/jesusNPL/SpecEvolution/main/R/wrappers.R")

# run only 10 bands
NIMBioS_PhyloSig <- demon_K(spectra = nimbiosData, # spectra in data.frame
                            tree = nimbiosTree, # phylogeny
                            nSIM = 1000, # number of simulations
                            nBands = 10) # number of bands
[1] "PS estimated for band 400 without Measurement Error"
[1] "PS estimated for band 410 without Measurement Error"
[1] "PS estimated for band 420 without Measurement Error"
[1] "PS estimated for band 430 without Measurement Error"
[1] "PS estimated for band 440 without Measurement Error"
[1] "PS estimated for band 450 without Measurement Error"
[1] "PS estimated for band 460 without Measurement Error"
[1] "PS estimated for band 470 without Measurement Error"
[1] "PS estimated for band 480 without Measurement Error"
[1] "PS estimated for band 490 without Measurement Error"

Explore the results

Code
glimpse(NIMBioS_PhyloSig)
Rows: 10
Columns: 11
$ Band       <chr> "400", "410", "420", "430", "440", "450", "460", "470", "48…
$ nSIM       <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
$ K_obs      <dbl> 0.12957981, 0.11563354, 0.10837909, 0.10087738, 0.09677001,…
$ K_pval     <dbl> 0.003, 0.003, 0.008, 0.014, 0.025, 0.029, 0.024, 0.024, 0.0…
$ K_sim_mean <dbl> 0.05533727, 0.05367654, 0.05324792, 0.05320002, 0.05416178,…
$ K_sim_SD   <dbl> 0.01989555, 0.01719941, 0.01722522, 0.01740005, 0.01819388,…
$ K_sim_SES  <dbl> 3.731616, 3.602275, 3.200608, 2.740071, 2.341899, 2.221665,…
$ K_BM_pval  <dbl> 0.000999001, 0.000999001, 0.000999001, 0.000999001, 0.00099…
$ K_BM_mean  <dbl> 0.9625519, 1.0212397, 0.9907225, 1.0305529, 1.0186125, 0.99…
$ K_BM_SD    <dbl> 0.6433328, 0.7548530, 0.6607160, 0.7249215, 0.6998070, 0.67…
$ K_BM_SES   <dbl> -1.294776, -1.199712, -1.335435, -1.282450, -1.317281, -1.3…

To finish this hands-on we can plot the spectra and Blomberg’s K values. To do that we will upload the phylogenetic signal estimated using the entire dataset.

Code
## Transform the long format
NIMBioS_spectra_long <- nimbiosData %>% 
  pivot_longer(!species, 
               names_to = "wavelength", 
               values_to = "reflectance") 

glimpse(NIMBioS_spectra_long)
Rows: 27,336
Columns: 3
$ species     <chr> "Tilia_platyphyllos", "Tilia_platyphyllos", "Tilia_platyph…
$ wavelength  <chr> "400", "410", "420", "430", "440", "450", "460", "470", "4…
$ reflectance <dbl> 0.01373617, 0.01443839, 0.01567654, 0.01652636, 0.01694690…

Prepare Phylogenetic signal results for plotting

Code
## Load phylogenetic signal
NIMBioS_PhyloSig <- read_csv("../output/NIMBios_K_phylosig.csv")
Rows: 201 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): Band, nSIM, K_obs, K_pval, K_sim_mean, K_sim_SD, K_sim_SES, K_BM_p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
NIMBioS_PhyloSig <- NIMBioS_PhyloSig %>% 
  mutate(Pval_cat = if_else(K_pval <= 0.05, "diff", "no_diff")) %>% 
  mutate(wavelength = Band) %>% 
  mutate(species = Pval_cat) %>% 
  select(Band, wavelength, Pval_cat, species, everything()) %>% 
# Rename P-Values column to "species" to match column in the spectra (only for plot).
  mutate(extra_y = 0.15, "Blomberg's K" = K_obs) 

Plot the results

Code
### Plot mean species and Blomberg's K values
library(ggnewscale)

p_ps_K <- NIMBioS_spectra_long %>% 
  ggplot(aes(x = as.numeric(wavelength), 
             y = reflectance, 
             colour = factor(species)
  )) +
  geom_line(linewidth = 0.5, alpha = 0.5, show.legend = FALSE) + 
  scale_colour_viridis_d(option = "turbo", direction = -1) + 
  # add a new scale color
  new_scale_color() + 
  # plot phylogenetic signal 
  geom_point(aes(x = wavelength, 
                 y = extra_y, 
                 color = species, 
                 size = `Blomberg's K`), # Pval_cat = species
             data = NIMBioS_PhyloSig, 
             inherit.aes = FALSE) + 
  scale_color_manual(name = "P-value", 
                     labels = c("< 0.05", "> 0.05"), 
                     values = c("no_diff" = "black", "diff" = "red")) + 
  guides(color = guide_legend(override.aes = list(size = 5))) + 
  xlab("Wavelength (nm)") + 
  scale_y_continuous(
    # Features of the first axis
    name = "Reflectance",
    # Add a second axis and specify its features
    # sec.axis = sec_axis(trans = ~.*1, name = "Blomberg's K")
  ) + 
  annotate("rect", 
           xmin = 400, xmax = 700, # Visible
           ymin = -Inf, ymax = Inf,  fill = "grey2", alpha = 0.1) + 
  annotate("text", x = 559:560, y = 0.2, label = "VIS", size  = 7) + 
  annotate("rect",
           xmin = 800, xmax = 1340, # NIR
           ymin = -Inf, ymax = Inf,  fill = "grey2", alpha = 0.1) + 
  annotate("text", x = 1070:1071, y = 0.2, label = "NIR", size  = 7) + 
  annotate("rect",
           xmin = 1445, xmax = 1790, # SWIR 1
           ymin = -Inf, ymax = Inf,  fill = "grey2", alpha = 0.1) + 
  annotate("text", x = 1620:1621, y = 0.2, label = "SWIR1", size  = 7) + 
  annotate("rect",
           xmin = 1955, xmax = 2400, # SWIR 2 
           ymin = -Inf, ymax = Inf,  fill = "grey2", alpha = 0.1) +  
  annotate("text", x = 2170:2171, y = 0.2, label = "SWIR2", size  = 7) + 
  theme_minimal() + 
  theme(
    legend.position = "right",
    legend.text = element_text(size = 18), 
    legend.title = element_text(size = 20), 
    axis.text = element_text(size = 17, colour = "black"),
    axis.title = element_text(size = 20), 
    plot.title = element_text(size = 20, vjust = -12, hjust = 0.05),
    plot.subtitle = element_text(size = 20, vjust = -10, hjust = 0.05)
  )

p_ps_K

Save the figure in our output folder

Code
### Save results
pdf(file = "../output/Fig_NIMBioS_K_phyloSignal.pdf", 
    height = 7, width = 10)

p_ps_K

dev.off()
quartz_off_screen 
                2